Exercise 6

Conceptual

Q1

(a)

Best subset.

(b)

Either one is possible.

(c)

i True

ii True

iii False

iv False

v False

Q2

(a)

Chose iii. The reason is that lasso has penalty which restricts the flexibilty of the linear model. Therefore, it often decreases variance and increas bias.

(b)

Chose iii. The reason is similar to (a)

(c)

Chose ii. For non-linear models, they have more flexibility than linear models.

Q3

(a)

Chose iv. For training set, The increasing of s is equivalent to increase the flexibility of the model, RSS will steadily decrease.

(b)

Chose ii. For test set, initially RSS will decrease then it will increase because of overfitting.

(c)

Chose iii. For variance, as flexvility increasing, variance will steadly increase.

(d)

Chose iv. For bias, as flexvility increasing, variance will steadly decrease.

(e)

Chose v. For irreducible error, it is a constant value and it is not correlated with the value of s.

Q4

Increasing \(\lambda\) from 0 means decrease the flexibility of the model

(a)

Chose iii.

(b)

Chose ii.

(c)

Chose iv.

(d)

Chose iii.

(e)

Chose v.

Q5

(a)

\[ min \quad \sum_{i=1}^2(yi-\sum_{j=1}^2\beta_jx_{ij})^2+ \lambda\sum_{j=1}^2\beta_j^2 \]

(b)

We can obtain the \(\hat\beta_1\) and \(\hat\beta_2\) by making the first derivative of this optimization proble equal 0.

let \(x_{11}=x_{12}=x_1\) and \(x_{21}=x_{22}=x_2\) \[ L=\sum_{i=1}^2(yi-\sum_{j=1}^2\beta_jx_{ij})^2+ \lambda\sum_{j=1}^2\beta_j^2\\ \frac{\alpha L}{\alpha \beta_1}=2\sum_{i=1}^2(yi-\sum_{j=1}^2\beta_jx_{ij})(-x_{i1})+2\lambda\beta_1=0\\ \rightarrow \quad \hat\beta_1=\frac{\sum_{i=1}^2y_ix_i-\hat\beta_2x_i^2}{\lambda+\sum_{i=1}^{2}x_i^2} -\,-(1)\\ \] Also, we can get the estimation of \(\beta_2\) in a simialr way: \[ \rightarrow \quad \hat\beta_2=\frac{\sum_{i=1}^2y_ix_i-\hat\beta_1x_i^2}{\lambda+\sum_{i=1}^{2}x_i^2} -\,-(2)\\ \] By eliminating the terms in two equations, we can obtain that: \[ \hat{\beta_1}=\hat{\beta_2} \]

(c)

\[ min \quad \sum_{i=1}^2(yi-\sum_{j=1}^2\beta_jx_{ij})^2+ \lambda\sum_{j=1}^2|\beta_j| \]

(d)

Then we assume that \(\beta_j\) doesn’t equal 0. \[ L=\sum_{i=1}^2(yi-\sum_{j=1}^2\beta_jx_{ij})^2+ \lambda\sum_{j=1}^2|\beta_j|\\ \frac{\alpha L}{\alpha \beta_1}=2\sum_{i=1}^2(yi-\sum_{j=1}^2\beta_jx_{i})(-x_{i1})+\lambda\frac{|\beta_1|}{\beta_1}=0\\ \rightarrow \sum_{j=1}^2\sum_{i=1}^2\beta_jx_{i}+\lambda\frac{|\beta_1|}{\beta_1}=\sum_{i=1}^2y_ix_i \quad-\,-(1)\\ \] Similarly \[ L=\sum_{i=1}^2(yi-\sum_{j=1}^2\beta_jx_{ij})^2+ \lambda\sum_{j=1}^2|\beta_j|\\ \frac{\alpha L}{\alpha \beta_2}=2\sum_{i=1}^2(yi-\sum_{j=1}^2\beta_jx_{i})(-x_{i1})+\lambda\frac{|\beta_2|}{\beta_2}=0\\ \rightarrow \sum_{j=1}^2\sum_{i=1}^2\beta_jx_{i}+\lambda\frac{|\beta_2|}{\beta_2}=\sum_{i=1}^2y_ix_i \quad-\,-(2) \\ \] We notice that the right hand side for (1) and (2) are equal. By eliminating some terms, we obtain that: \[ \frac{\hat{|\beta_2|}}{\hat{\beta_2}}=\frac{\hat{|\beta_1|}}{\hat{\beta_1}} \] Here we can see there are many solutions for this equation.

Q6

(a)

Let y=8 and \(\lambda=5\)

beta=seq(-10,10,0.05)
ridge=(8-beta)^2+5*beta^2
plot(beta,ridge,main='ridge regression')
points(3/4,(8-4/3)^2+5*(4/3)^2,col='blue',pch=15)

It can be seen that \(\hat{\beta}^R_j=y_j/(1+\lambda)=4/3\) optimizes the ridge regression equation.

(b)

Let y=8 and \(\lambda=5\)

beta=seq(-10,10,0.05)
lasso=(8-beta)^2+5*abs(beta)
plot(beta,lasso,main='lasso regression')
points(5.5,(8-5.5)^2+5*(5.5),col='blue',pch=15)

It can be seen that \(\hat{\beta}^L_j=y_j-\lambda/2=5.5\) optimizes the ridge regression equation.

Q7

(a)

\[ \begin{aligned} Likelihood&=P(\bf{y}|\beta)\\ &=\prod_{i=1}^n P(y_i|\beta,\delta^2)\\ &=\prod_{i=1}^n \frac{1}{\delta\sqrt{2\pi}}\exp(-\frac{(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2}{2\delta^2})\\ &=(\frac{1}{\delta\sqrt{2\pi}})^n\exp(-\frac{\sum_{i=1}^n(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2}{2\delta^2}) \end{aligned} \]

(b)

\[ \begin{aligned} Posterior&=(\frac{1}{\delta\sqrt{2\pi}})^n\exp(-\frac{\sum_{i=1}^n(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2}{2\delta^2}) \prod_{j=1}^p\frac{1}{2b}\exp(\frac{-|\beta_j|}{b})\\ &=(\frac{1}{\delta\sqrt{2\pi}})^n(\frac{1}{2b})^p\exp(-\frac{\sum_{i=1}^n(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2}{2\delta^2}+\sum_{j=1}^p\frac{-|\beta_j|}{b}) \end{aligned} \]

(c)

Since maximizing log from of a expression is equivalent to maximize itself. We can take a log on both sides. \[ \begin{aligned} \log{Posterior}&=\log(\frac{1}{\delta\sqrt{2\pi}})^n(\frac{1}{2b})^p-\frac{\sum_{i=1}^n(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2}{2\delta^2}+\sum_{j=1}^p\frac{-|\beta_j|}{b}\\ \end{aligned} \] \[ \begin{aligned} \mathop{argmax}\limits_{\beta}\quad Posterior&=\mathop{argmax}\limits_{\beta}\quad\log(\frac{1}{\delta\sqrt{2\pi}})^n(\frac{1}{2b})^p-\frac{\sum_{i=1}^n(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2}{2\delta^2}+\sum_{j=1}^p\frac{-|\beta_j|}{b}\\ &=\mathop{argmax}\limits_{\beta}\quad\sum_{i=1}^n-\frac{(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2}{2\delta^2}+\sum_{j=1}^p\frac{-|\beta_j|}{b}\\ &=\mathop{argmin}\limits_{\beta}\quad \frac{1}{2\delta^2}(\sum_{i=1}^n(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2{}+\sum_{j=1}^p\frac{2\delta^2}{b}|\beta_j|)\\ &=\mathop{argmin}\limits_{\beta}\quad \sum_{i=1}^n(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2{}+\sum_{j=1}^p\frac{2\delta^2}{b}|\beta_j|\\ &=\mathop{argmin}\limits_{\beta}\quad \sum_{i=1}^n(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2{}+\lambda\sum_{j=1}^p|\beta_j|\\ \text{by letting}\quad\lambda=\frac{2\delta^2}{b} \end{aligned} \] This is the lasso equation.

(d)

\[ \begin{aligned} Posterior&=\prod_{i=1}^n \frac{1}{\delta\sqrt{2\pi}}\exp(\frac{-(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2}{2\delta^2}) \prod_{j=1}^p\frac{1}{\sqrt{2\pi c}}\exp(\frac{-\beta_j^2}{2c})\\ &=(\frac{1}{\delta\sqrt{2\pi}})^n\exp(\frac{-\sum_{i=1}^n(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2}{2\delta^2}) (\frac{1}{\sqrt{2\pi c}})^p\exp(\frac{-\sum_{j=1}^p\beta_j^2}{2c}) \end{aligned} \]

(e)

Firstly, take the log on both sides: \[ \log Posterior=\log(\frac{1}{\delta\sqrt{2\pi}})^n(\frac{1}{\sqrt{2\pi c}})^p+\frac{-\sum_{i=1}^n(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2}{2\delta^2} +\frac{-\sum_{j=1}^p\beta_j^2}{2c} \] \[ \begin{aligned} \mathop{argmax}\limits_{\beta} \quad Posterior&=\mathop{argmax}\limits_{\beta} \quad \log(\frac{1}{\delta\sqrt{2\pi}})^n(\frac{1}{\sqrt{2\pi c}})^p+\frac{-\sum_{i=1}^n(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2}{2\delta^2} -\frac{\sum_{j=1}^p\beta_j^2}{2c}\\ &=\mathop{argmax}\limits_{\beta}\quad\frac{-\sum_{i=1}^n(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2}{2\delta^2} -\frac{\sum_{j=1}^p\beta_j^2}{2c}\\ &=\mathop{argmin}\limits_{\beta}\quad\frac{1}{2\delta^2}(\sum_{i=1}^n(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2+\frac{2\delta^2}{2c}\sum_{j=1}^p\beta_j^2)\\ &=\mathop{argmin}\limits_{\beta}\quad\sum_{i=1}^n(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2+\frac{2\delta^2}{2c}\sum_{j=1}^p\beta_j^2\\ &=\mathop{argmin}\limits_{\beta}\quad\sum_{i=1}^n(y_i-\beta_0+\sum_{j=1}^px_{ij}\beta_j)^2+\lambda\sum_{j=1}^p\beta_j^2\\ \text{by letting} \quad \lambda=\frac{2\delta^2}{2c} \end{aligned} \] This is the equation from of ridge regression.

Applied

Q8

(a)

set.seed(1)
x=rnorm(100)
epi=rnorm(100)

(b)

Y=1+2*x+3*x^2+4*x^3+epi

(c)

library(leaps)
new_x=data.frame(x,x^2,x^3,x^4,x^5,x^6,x^7,x^8,x^9,x^10,Y)
reg.fit=regsubsets(Y~.,new_x)
reg.summary=summary(reg.fit)
par(mfrow=c(3,1))
#can plot with one dimension data
plot(reg.summary$cp,xlab = 'number of variables',ylab='cp',type='l')
plot(reg.summary$bic,xlab = 'number of variables',ylab='bic',type='l')
plot(reg.summary$adjr2,xlab = 'number of variables',ylab='adjr2',type='l')

number=c(which.min(reg.summary$cp),which.min(reg.summary$bic),which.max(reg.summary$adjr2))
number

## [1] 4 3 4

coef(reg.fit,4)

## (Intercept)           x         x.2         x.3         x.5 
##  1.07200775  2.38745596  2.84575641  3.55797426  0.08072292

coef(reg.fit,3)

## (Intercept)           x         x.2         x.3 
##    1.061507    1.975280    2.876209    4.017639

coef(reg.fit,4)

## (Intercept)           x         x.2         x.3         x.5 
##  1.07200775  2.38745596  2.84575641  3.55797426  0.08072292

The best models for 3 criteria are not the same and they are shown above.

(d)

reg.fwd=regsubsets(Y~.,new_x,nvmax=10,method='forward')
reg.summary1=summary(reg.fwd)
par(mfrow=c(2,2))
#can plot with one dimension data
plot(reg.summary1$cp,xlab = 'number of variables',ylab='cp',type='l')
plot(reg.summary1$bic,xlab = 'number of variables',ylab='bic',type='l')
plot(reg.summary1$adjr2,xlab = 'number of variables',ylab='adjr2',type='l')
number=c(which.min(reg.summary1$cp),which.min(reg.summary1$bic),which.max(reg.summary1$adjr2))
number

## [1] 4 3 4

reg.bwd=regsubsets(Y~.,new_x,nvmax=10,method='forward')
reg.summary2=summary(reg.bwd)
par(mfrow=c(3,1))

#can plot with one dimension data
plot(reg.summary2$cp,xlab = 'number of variables',ylab='cp',type='l')
plot(reg.summary2$bic,xlab = 'number of variables',ylab='bic',type='l')
plot(reg.summary2$adjr2,xlab = 'number of variables',ylab='adjr2',type='l')

number=c(which.min(reg.summary2$cp),which.min(reg.summary2$bic),which.max(reg.summary2$adjr2))
number

## [1] 4 3 4

reg.summary1

## Subset selection object
## Call: regsubsets.formula(Y ~ ., new_x, nvmax = 10, method = "forward")
## 10 Variables  (and intercept)
##      Forced in Forced out
## x        FALSE      FALSE
## x.2      FALSE      FALSE
## x.3      FALSE      FALSE
## x.4      FALSE      FALSE
## x.5      FALSE      FALSE
## x.6      FALSE      FALSE
## x.7      FALSE      FALSE
## x.8      FALSE      FALSE
## x.9      FALSE      FALSE
## x.10     FALSE      FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: forward
##           x   x.2 x.3 x.4 x.5 x.6 x.7 x.8 x.9 x.10
## 1  ( 1 )  " " " " "*" " " " " " " " " " " " " " " 
## 2  ( 1 )  " " "*" "*" " " " " " " " " " " " " " " 
## 3  ( 1 )  "*" "*" "*" " " " " " " " " " " " " " " 
## 4  ( 1 )  "*" "*" "*" " " "*" " " " " " " " " " " 
## 5  ( 1 )  "*" "*" "*" " " "*" "*" " " " " " " " " 
## 6  ( 1 )  "*" "*" "*" " " "*" "*" " " " " "*" " " 
## 7  ( 1 )  "*" "*" "*" " " "*" "*" "*" " " "*" " " 
## 8  ( 1 )  "*" "*" "*" " " "*" "*" "*" "*" "*" " " 
## 9  ( 1 )  "*" "*" "*" " " "*" "*" "*" "*" "*" "*" 
## 10  ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"

reg.summary2

## Subset selection object
## Call: regsubsets.formula(Y ~ ., new_x, nvmax = 10, method = "forward")
## 10 Variables  (and intercept)
##      Forced in Forced out
## x        FALSE      FALSE
## x.2      FALSE      FALSE
## x.3      FALSE      FALSE
## x.4      FALSE      FALSE
## x.5      FALSE      FALSE
## x.6      FALSE      FALSE
## x.7      FALSE      FALSE
## x.8      FALSE      FALSE
## x.9      FALSE      FALSE
## x.10     FALSE      FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: forward
##           x   x.2 x.3 x.4 x.5 x.6 x.7 x.8 x.9 x.10
## 1  ( 1 )  " " " " "*" " " " " " " " " " " " " " " 
## 2  ( 1 )  " " "*" "*" " " " " " " " " " " " " " " 
## 3  ( 1 )  "*" "*" "*" " " " " " " " " " " " " " " 
## 4  ( 1 )  "*" "*" "*" " " "*" " " " " " " " " " " 
## 5  ( 1 )  "*" "*" "*" " " "*" "*" " " " " " " " " 
## 6  ( 1 )  "*" "*" "*" " " "*" "*" " " " " "*" " " 
## 7  ( 1 )  "*" "*" "*" " " "*" "*" "*" " " "*" " " 
## 8  ( 1 )  "*" "*" "*" " " "*" "*" "*" "*" "*" " " 
## 9  ( 1 )  "*" "*" "*" " " "*" "*" "*" "*" "*" "*" 
## 10  ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"

Same answer with (c).

(e)

library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-18

library(magrittr)
grid=10^seq(10,-2,length=100)
xx=model.matrix(Y~.,new_x)[,-1]
cv.out=cv.glmnet(xx,Y,alpha=1)
plot(cv.out)

bestlam=cv.out$lambda.min
#glmnet(xx,Y,lambda=grid) %>% predict(type='coefficients',s=bestlam)[1:11,]
out=glmnet(xx,Y,lambda=grid)
predict(out,type='coefficients',s=bestlam)[1:11,]

## (Intercept)           x         x.2         x.3         x.4         x.5 
##  1.17017645  2.17605114  2.63586627  3.77660851  0.04248355  0.02420237 
##         x.6         x.7         x.8         x.9        x.10 
##  0.00000000  0.00290213  0.00000000  0.00000000  0.00000000

From \(\beta_1\) to \(\beta_3\), all the predictors are assigned substantial weights. For other predictors, they gives less influence.

(f)

Y=1+7*x^7+epi
new_x=data.frame(x,x^2,x^3,x^4,x^5,x^6,x^7,x^8,x^9,x^10,Y)
reg.fit=regsubsets(Y~.,new_x)
reg.summary=summary(reg.fit)
par(mfrow=c(2,2))
#can plot with one dimension data
plot(reg.summary$cp,xlab = 'number of variables',ylab='cp',type='l')
plot(reg.summary$bic,xlab = 'number of variables',ylab='bic',type='l')
plot(reg.summary$adjr2,xlab = 'number of variables',ylab='adjr2',type='l')
number=c(which.min(reg.summary$cp),which.min(reg.summary$bic),which.max(reg.summary$adjr2))
number

## [1] 2 1 4

coef(reg.fit,4)

## (Intercept)           x         x.2         x.3         x.7 
##   1.0762524   0.2914016  -0.1617671  -0.2526527   7.0091338

coef(reg.fit,3)

## (Intercept)         x.2         x.5         x.7 
##   1.0862698  -0.1618472  -0.0599260   7.0133154

coef(reg.fit,4)

## (Intercept)           x         x.2         x.3         x.7 
##   1.0762524   0.2914016  -0.1617671  -0.2526527   7.0091338

xx=model.matrix(Y~.,new_x)[,-1]
cv.out=cv.glmnet(xx,Y,alpha=1)
plot(cv.out)

bestlam=cv.out$lambda.min
#glmnet(xx,Y,lambda=grid) %>% predict(type='coefficients',s=bestlam)[1:11,]
out=glmnet(xx,Y,lambda=grid)
predict(out,type='coefficients',s=bestlam)[1:11,]

## (Intercept)           x         x.2         x.3         x.4         x.5 
##    1.820215    0.000000    0.000000    0.000000    0.000000    0.000000 
##         x.6         x.7         x.8         x.9        x.10 
##    0.000000    6.796694    0.000000    0.000000    0.000000

Lasso selects the correct model but best subset selection contains other terms.

Q9

(a)

library(ISLR)
train_index=sample(777,600,replace=FALSE)
train=College[train_index,]
test=College[-train_index,]

(b)

lm.fit=lm(Apps~.,train)
pre=predict(lm.fit,test)
err.lm=mean((pre-test$Apps)^2)
mean((pre-test$Apps)^2)

## [1] 668476

(c)

grid=10^seq(10,-2,length=100)
train=sample(777,600,replace=FALSE)
test=-train
x=model.matrix(Apps~.,College)[,-1]
y=College$Apps
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
cv.out=cv.glmnet(x[train,],y[train],alpha=0)
bestlam=cv.out$lambda.min
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
err.ridge=mean((ridge.pred-y[test])^2)
mean((ridge.pred-y[test])^2)

## [1] 1243058

coef(ridge.mod)[,18]

##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
##  3.034889e+03 -1.839221e-01  6.799176e-05  1.640162e-04  3.518793e-03 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
##  3.338727e-03  3.071288e-05  5.843315e-05  2.492941e-06  2.526107e-05 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##  1.633314e-04  4.586687e-05  4.382170e-03  4.466248e-03  4.325403e-03 
##   perc.alumni        Expend     Grad.Rate 
## -1.538424e-03  1.020503e-05  1.615674e-03

(d)

lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
err.lasso=mean((lasso.pred-y[test])^2)
mean((lasso.pred-y[test])^2)

## [1] 1122837

coef=predict(lasso.mod,type="coefficients",s=bestlam,newx=x[test,])
length(coef[coef!=0])

## <sparse>[ <logic> ] : .M.sub.i.logical() maybe inefficient

## [1] 15

16 non-zero coefficients.

(e)

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

set.seed(1)
pcr.fit=pcr(Apps~.,data=College,subset=train,scale=TRUE,validation='CV')
validationplot(pcr.fit,val.type='MSEP')

pcr.pred=predict(pcr.fit,x[test,],ncomp=17)
err.pcr=mean((pcr.pred-y[test])^2)
mean((pcr.pred-y[test])^2)

## [1] 1127008

(f)

set.seed(1)
pls.fit=plsr(Apps~.,data=College,subset=train,scale=TRUE,validation='CV')
validationplot(pls.fit,val.type='MSEP')

pls.pred=predict(pls.fit,x[test,],ncomp=12)
err.pls=mean((pls.pred-y[test])^2)
mean((pls.pred-y[test])^2)

## [1] 1124830

(g)

err=c(err.lm,err.ridge,err.lasso,err.pcr,err.pls)
names(err)=c('lm','ridge','lasso','pcr','pls')
barplot(err)

The ridge regression seems to perform slightly better than others. There are not much difference among test errors.

Exercise 6

wenbo

Conceptual

(a)

(b)

(c)

Q2

(a)

(b)

(c)

Q3

(a)

(b)

(c)

(d)

(e)

Q4

(a)

(b)

(c)

(d)

(e)

Q5

(a)

(b)

(c)

(d)

Q6

(a)

(b)

Q7

(a)

(b)

(c)

(d)

(e)

Applied

Q8

(a)

(b)

(c)

(d)

(e)

(f)

Q9

(a)

(b)

(c)

(d)

(e)

(f)

(g)